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Abstract 

We consider regression models involving multilayer perceptrons (MLP) 
with one hidden layer and a Gaussian noise. The data are assumed to be 
generated by a true MLP model and the estimation of the parameters 
of the MLP is done by maximizing the likelihood of the model. When 
the number of hidden units of the true model is known, the asymptotic 
distribution of the maximum likelihood estimator (MLE) and the likeli- 
hood ratio (LR) statistic is easy to compute and converge to a \ 2 l aw - 
However, if the number of hidden unit is over-estimated the Fischer in- 
formation matrix of the model is singular and the asymptotic behavior 
of the MLE is unknown. This paper deals with this case, and gives the 
exact asymptotic law of the LR statistics. Namely, if the parameters of 
the MLP lie in a suitable compact set, we show that the LR statistics is 
the supremum of the square of a Gaussian process indexed by a class of 
limit score functions. 

1 Introduction 

Feedforward neural networks are well known and are popular tools to deal with 
non-linear statistic models. We can describe MLP as a parametric family of 
probability density functions. If the noise of the regression model is Gaussian 
then it is well known that the maximum likelihood estimator is equal to the least- 
square estimator. Therefore, Gaussian likelihood is the usual assumption when 
we consider feedforward neural networks from a statistical viewpoint. White [9] 
reviews statistical properties of MLP estimation in detail. However he leaves 
an important question pending: the asymptotic behavior of the estimator when 
an MLP in use has redundant hidden units and the Fisher information matrix 
is singular. Amari, Park and Ozeki [1] give several examples of behavior of the 
LR in such cases. Fukumizu [4] shows that, for unbounded parameters, the LR 
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statistic can have an order lower bounded by 0(log(n)) with n the number of 
observations instead of the classical convergence property to \ 2 1 &W - 

However, a fairly natural assumption is to consider that the parameters 
are bounded. Indeed, computer calculations always assume that numbers are 
bounded. Moreover a safe practice is to bound the parameters in order to 
avoid numerical problems. In such context, different situations can occur. In 
some cases, such as mixture models, the LR is tight and the calculation of the 
asymptotic distribution is possible (see Liu and Shao [7] ) . In other cases it may 
occur that even if the parameters are bounded the likelihood ratio diverges this 
is for example the case in hidden Markov models (see Gassiat and Keribin [5]). 
So the behavior of likelihood ratio in the case of MLPs with bounded parameters 
is still an open question. 

In this paper, we derive the distribution of the likelihood ratio if the param- 
eters are in a suitable compact set (i.e. bounded and closed). To obtain this 
result we use recent techniques introduced by Dacunha-Castelle and Gassiat [2] 
and Liu and Shao [7]. These techniques consist in finding a parameterization 
separating the identifiable part and the unidentifiable part of the parameter 
vector, then we can obtain an asymptotic development of the likelihood of the 
model which allows us to show that a set of generalized score functions is a 
Donsker class and to find the asymptotic distribution of the LR statistic. The 
paper is organized as follows. In section 2 we state the model and the main as- 
sumptions. Section 3 presents our main theorem and explains its meaning with 
a brief summary and a statement of significance of this work. In section 4 we 
applied this theorem to the identification to the true architecture of the MLP 
function. In section 5 we show that MLP functions with sigmoidale transfert 
functions verify the assumption of this theorem. Finally, we prove the theorem 
in the appendix. 

2 The model 

We consider the model of regression for i e N*: 

Y i = F eo (X i )+e i (1) 

where Xi € R d are observed exogenous variables and Yi is the variable to explain. 
The data (Yi,Xi) are assumed to be generated by this true model. The noise 
(ei)igN* i s a sequence of independent and identically distributed (i.i.d.) Af(0, a 2 ) 
variables. 

2.1 The regression function 

Let x = (l,x\, ■ ■ ■ , Xd) T € R d+1 be the vector of inputs and Wi := (wio 7 wn, • • • , Wid) T , 
the MLP function with k hidden units can be written : 

k 

F g {x) = /3 + aid (wfx) , 

i=l 
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with 9 = {(3,ai,---,a k ,w w ,---,wi d ,---,w k0 ,---,w kd ) C R kx ( d + 2 )+ 1 the pa- 
rameters of the model. The transfer function <fi will be assumed bounded and 
three times derivable. We assume also that the first, second and third deriva- 
tives of the transfer function cfr. <j> , <j> and <p are bounded. In order to simplify 
the presentation, we assume that the variance of the noise a 2 is known. Note 
that it is assumed that the true model ((T|) is included in the considered set of 
parameter O. Let us define the true number of hidden units as the smallest in- 
teger k° so that 9° = a®, ■ ■ ■ , a° , w° , • • • , • • • , w^ , • • • , Wy) exists with 
F g o equal to the true regression function of model ((T|) . 

2.2 Parameterization of the model 

Let us write |.| for the Euclidean norm. Let us consider the variable Zi = 
(Xi,Yi) where and Yi follow the probability law induced by the model ([T]). 
We assume that the law of Xi will be q(x)Xd(x) with the Lebesgue measure 
on R d and q(x) > for all x £ M. d . The likelihood of the observation z := (x, y) 
for a parameter vector 9 = (/3, oi, • • • , a*, b\, ■ ■ ■ , W\\, • • • , wid, ■ ■ ■ , w k d) will 
be written: 

fo(z) = -A=e-^y-M*))\ {x y 

Let -q > be a small constant and M a huge constant, the set of possible 
parameters will be 

6fc := {6 = (jS, oi, • • • , a k , w w , ■ ■ ■ , w ld , ■ • • , w k0 , • • • , w kd ) , 
VI < i < k, \\wi\\ > rj, \\ ai \\ > rj and ||0|| < M} . 

Constraints on the parameter set. The constraint \\wi\\ > r\ is introduced 
in order to avoid the hidden unit from being constant like the bias /3, instead 
of being a function of x. The constraint ||aj|| > r\ forces the parameters of 
the hidden units to converge to one of the parameter vector Wj, j € {1, • • • , k°} 
when they maximize the likelihood. Finally, with the constraint \\9\\ < M, the 
parameters are bounded and the set Q k compact. Note that these constraints 
are very easy to set in practice. 

The true density of the observation will be denoted f(z) := fgo(z). The main 
goal of the parametric statistic is to give an estimation of the true parameter 
9q thanks to the observations (z%, ■ ■ ■ , z n ). This can be done by maximizing the 
log-likelihood function : 

n 

l n {9) := J>g/ fl (*). 
»=i 

The parameter vectors 9 n realizing the maximum will be called Maximum Like- 
lihood Estimator (MLE). However, the MLE belongs to a non-null dimension 
submanifold if the number of hidden units is overestimated. In the next section 
we will study the behavior of 

n 

sup V^log/flOi) - log/(z 4 ), where k > fc° 
eee k l=1 
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which is the key to guess the true architecture of the MLP model. 



3 Asymptotic distribution of the LR statistic 

We will use the abbreviation Pg = J gdP for an intcgrable function g and 
a measure P. We will define the L 2 (P) norm as \\g\\2 = \J Pg 2 and the map 
fl : L 2 (P) -> L 2 (P) as fl(g) = if g ^ 0. The maximum of the log-likelihood 
will be denoted : 

n 

X n = SU P y^log fo(Zi) - log f(Zi). 
6€0 k j=1 

Finally, let us note e(z) := ^ (y - (/3° + J2ti a ?<X&° + w f x ))) ■ 
For what follows, we will assume the properties: 

H-l : the parameters of realizing the true regression function Fg lie in the 
interior of 0^. 

H-2 : Let k be an integer greater or equal to k° and 

= (ft, 01, • • • , Ofc, w w , ■■■ , wid, ■■■ , Wko, ■■■ , W k d) ■ 
The model is identifiable in the weak following sense: 

k° k 
F gn = Fg P° = ft and ^ a Q t 5 w » = ^ a t S m . 

i=l i=l 

Note that, it is possible that some new constraint on the parameters have 
to be set to fulfill this assumption. For example, if the transfert function 
is the hyperbolic tangent (or any odd function), the constraints on the 
parameters a* will be : a* > i], in order to avoid a symetry on the sign 
(because tanh(— t) = — tanh(i)). 

H-3 : E(\\Xf ) < 00. 

H-4 : the functions of the set 

( 1, ( x k xi<j)" (wfi)) , <p" {wfx) 1<i<ka , 

\ V J i<i<k<d, i<i<k° — 

(x k <t>'(wfxj) U(wfx)) ,(t(wfx)) ) 

V )\<k<d, l<i<k° \ 1 Jl<i<k° \ Jl<i<k°J 

are linearly independent in the Hilbert space L 2 (q\d). 
We get then the following result: 
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Theorem 3.1 Under the assumptions H-l, H-2 and H-3, a centered Gaus- 
sian process {Ws,S E ¥ k } with continuous sample path and covariance kernel 
P (W Sl Ws 2 ) = P OS1S2) exists so that 

lim 2\ k n = sup (max(Ws,0)) 2 . 

n->oo S( - F fc 

The index set ¥ k is defined as ¥ k = U t ¥ k , the union runs over any possible 
t = (t , ■■■ ,t k o) e N k ° +1 with = t < h < ■ ■ ■ < t k o < k and 

¥ k = {n ( 7 e(z) + e£ eie(z)<Kwfx) + £*! e{z)^ {wf x)Qj x 
+ Elie(z)sg(a°)<f>"(wfx) (s(i) E vfxx T v^)) , 

7,ei,---,e fe o el ; Ci, ■ ■ ■ , Cfc°, v\, ■ ■ ■ , i/f fc0 € R d+1 | , 

where S(i) = 1 i/ a vector q exists so that E'=t 1 +1 <Zj = 1 arl< ^ S/=t 1+1 V^ v j ~ 
0, otherwise S(i) = 0. TTie function sg is defined by sg(x) = 1 if x > and 
sg(x) = — 1 if x <0. 

This theorem is proved in the appendix. Note that this theorem prove that 
the LR statistic is tight so penalized likelihood yields a consistent method to 
identify the minimal architecture of the true model 

4 Identification of the architecture of the MLP 

The point is to guess the number of hidden units of the true MLP function k°. 
If k° is known, the information matrix will be regular (see Fukumizu (3)) and 
pruning of useless parameters will be easy with classical statistical method as 
in Cottrell et al (2). Here, we assume that the possible number of hidden units 
in the MLP function is bounded by a large number K. So the set of possible 
parameters will be = U^ =1 @k- 

Note that the log-likelihood of the model: l n (9) := E"=i ^°s(fe( z i)) is known 
up to the constant E"=i l°g( x i); independent of the parameter 9. We define k, 
the estimator of maximum of penalized likelihood, as the number of hidden unit 
maximizing: 

T n (k) := max{Z„(0) : 9 e G k } - Pn (k) (2) 

where p n (k) is a term which penalizes the log- likelihood in function of the num- 
ber of hidden units of the model. 

Let Pn(-) be a increasing sequence so that p n (k\) — Pn^ki) —> 00 for all 
fci > fc 2 and linin^oo = 0. Note that such conditions are verified by 

BIC-like criterion. 

We get then the following result: 

Theorem 4.1 If the assumptions H-l, H-2, H-3 and H~4 are true then k A k° . 
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The proof is an adaptation of the proof of theorem 2.1 of Gassiat [6]. Let us 
write ln(f) the log-likelihood of the true MLP model. For any fg,9 € 0, let 



se(z) := — , where ||.||2 is the L 2 (fXd+i) norm, 



- 1 
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be the generalized score function. Then, it is obvious to see that under as- 
sumptions H-l, H-2, H-3 and H-4 the conditions Al and A2 of Gassiat (6) are 
fullhlled. Hence we get the inequality 1.2 of Gassiat (6): 

sup(ln(8) - ln(f)) < \ sup ^=) Se ^f 

where (se)-(z) = — min {0, sg(z)}. 
Now, 

P(fc>fc°) < Efo+iP(r«(fc) >T n (k )) 

= Efo+i P (sup eee (Zn(0) - ln(f)) - sup 0eefcO (ln(6) - ln(f))) 

< p(sn Peee g^^>p„(fc)-p„(fcO) N 



Now, by Gassiat (6): 
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_ (Ei=i£g(gO) n . | 

&UP I, \? (7 \ ~ 0p ^> 



where 0p(l) means bounded in probability, and 

P(jfe > k°) ™ 

In the same way 

P(* < *°) < ^P f sup -'"(/)) > 



fe=i 



But the set {log(^,(9e e} is Glivcnko-Cantelli, so that sup 9ee C"( 9 )-'"(J)) 
converges in probability to 



inf / log < 0. 



«eoy °f e 

Finally 



P(jfe < k°) ™ 
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5 Application to sigmoidal transfert functions 



In this section, The assumptions H-2 and H-4 will be verified for sigmoidal 
transfert functions : 

m = ■ 

1 + e t 

The assumption H-2 have been shown for hyperbolic tangent functions by Suss- 
mann (8) with additional constraint : a,i > r), morevover MLP with sigmoidale 
tranfert functions or hyperbolic tangente transfert functions are equivalent, be- 
cause an one-to-on correspondence between the two kinds of MLP exists as 
1+ e- t = (1 + tanh(i/2))/2 (see Fukumizu (3)). Hence the assumption H-2 is 
verified for sigmoidale functions with the additional constraint. 

The main point is to verify H-4. The proof use an extension of the result of 
Fukumizu[3]. 

We define the complex sigmoidal function on C by 4>{z) — 1+ ],- z ■ 
The singularities of <ft are : 

{z £ C\z = (2n+ l)nV^l,n £ Z} 

all of which are poles of order 1. Next we review a fundamental propositions in 
complex analysis. 

Proposition 5.1 Let <f> be a holomorphic function on a connected open set D 
in C and p be a point in D. If a sequence {Pn}^Li exists in D so that p n ^ 
p, lim„^„p„ = p and (j}(p n ) = for all n e N then <p(z) = for all z £ D. 

Proposition 5.2 Le <f> be a holomorphic function on a connected open set D 
in C, and p be a point in D. Then the following equivalence relations hold: 

• p is a removable singularity 

& lim f(z) £ C 

z— >p 

• p is a pole 

lim |/(,z)| = 00 

z— >p 

• p is an essential singularity 

lim |/(2)| does not exist 

z^p 

Let wo , • • • , Wfco be the parameters of the minimale, true, MLP function (in 
order to simplify the notations, the exponent "0" is missing). By the lemma 3 
of Fukumizu (3), a basis of R d (x^\ • • • , x^) exists so that 



1. For alUe {l,---,fc } and all h £ {l,---,d} 
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2. For all i\, i 2 € {1, • • • , fc }, ii ^ i 2 and all h e {1, • • • , d} 

d / d 



W. 



(h) 



For h, 1 < h < d and i € {1, • • • , fc } let be mf } := £j =1 tu tf xj. fc) . We fix I for 
a while. We set 



(2n + l^v^T- Wio 
u= m ,ne 



Clearly the points in S 1 ^ are the singularities of 0^m-^u + Wj O )- Note that 
these points are pole of order 1 for 



{rn-pu + w i0 ) = — 



of order 2 for 

e 



4>'(m { pu + w i0 ) 



and 3 for 



^1 + e -( m i°"+ , "*o)^ 



b"(mfu + u; i0 ) = -j + 2 - 



'L + e~V" ' " T " " ! ': ( I f , "i " 4 " !i 

Let be £)Q) := C — yJ\<i<k"Sf \ Holomorphic functions on are defined as 
follows: 

:= a a + Yn=i «i<H TO f )u + w io) + Ya=i e i4> ' {m { pu + w i0 ) 
+ Yh=i Yfj=i Pij<fi' {mf^u + w i0 )x^u + Yh=i 6i4>" {mf^u + w i0 ) 
+ Ei=i E?,fc=i, j<k lijkj' (jnfu + w^xfxfu 2 

The functions in the set 



(V (x k xi(j)" (wfx)^j i<i<k<d i<4<fc0 , <P" K° T a:)i<i<feo , (wfs)) 

(^^) 1 < J < fe o'(^° T -)) 1<4<fe o'(^)) 



l<fc<<l l<i<fc° 

Ki<l 



are linearly independent if the following property is verified : 

Vu G D(l),^( l \u) = <^=> «i, £i, /3ij, Si and 7^ are equal to 
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Let us assume that: Vm G D"' , &( l \u) = 0, then by proDOsition l5.2l all the point 
?j- are remo** 
Let us write 



in are removable singularities 



(!) 7rV^T-b° (i) 

Pi (J) t ^ 



Clearly , for 1 < i < k° - 1, pj$ ^ because for all i 1; i 2 G {1, • • • , fc }, 
ii ^ %2 and all h G {!,•■■, d} 



W tl0 + ^2 W h3 X j ^ ± ( W *20 + ^2 
J'=l 




So, \E , *-"(m) can be written as: 



*W(tt) = a k o4>{m { pu + w i0 ) + fei=i P^ixf^u + e k o^j cf>' (mj$u + w k o ) 
+ fei,j=l, i<j ImjX^xfK 2 + S^j <p"(mf ( ]u + W k a a ) + ^^(u) 

where 

*feo_i(it) : = "o + Si=i 1 ai<j)(m[ l) u + w i0 ) 

. v^fc°-i ,'/(') i \ i r t "' 1 r li « CO i \ (0 
+ Z^i=i e i0 ( m i u + Wio) + l^i=i Z^=i Pij4> (rn\'u + Wio)xyu 

+ J2i=i 1 s i4>" (mr'u + w i0 ) + YnJi 1 Z)£k=i lijk(f>" (mf' u + w M )x { p x k l) u 2 

The point pS is a regular point of v I / ^o_ 1 (m) while c/)(m^d u + W/j-oq) has a pole of 

order 1 at pS, (m!2 u + w k o Q ) has a pole of order 2 at and <fi (jn^oU + w k o ) 

has a pole of order 3 et pS. Since pS is a removable singularity of $W(tt), we 
have: 

d d 

a fc n = 0, e fc n = 0, ^ Pk°%Xj = and 7^ xf'xf' = S k o = 

2 — 1 — 1, Z<J 

As a result = 5'[,o_ 1 (w)- Applying the same argument successively to 

Pfco-i! ' ' ' >Pi we finally obtain, for all 1 < i < k°, 1 < j < k < d: 

Gti — 
<H = 

(0 w 

j<fe JijkXj x k 

6i = 
a a = 



Since • • • ,a; (d) ) form a basis of R d , we have /3y = for all 1 < i < fc°, 

1 < 3 < d. 
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For j ijk , we get: 

j,k=i, j<k k=i \]=i j 

and, since (x^ 1 ', • • • , x^ d ^ form a basis of R d , we obtain, for all I G {1, • • ■ , d}: 

Ej=l lijdXj = 

and by the same remark on (x^ , • • • , x^* 1 ) , 7^ = for all 1 < i < k°, 1 < j < 
k<d. 

This prove that H-4 hold for sigmoidale functions with the additional con- 
straints Vi € {1, • • ■ , fc }, a% > r/M 

6 Conclusion 

We have computed the asymptotic distribution of the LR statistic for parametric 
MLP regression. This theorem can be applied to the most widely used transfer 
functions for MLP: the sigmoidal functions. Note that the results assume some 
constraints on the parameters of the MLP, the constraints on and Wi may 
be relaxed, but a more clever reparameterization and a higher order in the 
development of the LR statistics should certainly be required. The asumption 
on a 2 is certainly easier to remove, however the development of lemma 17.11 
will be much more complicated and so the limit score functions. Finally, this 
theorem shows that the LR statistic is tight, so information criteria such as the 
Bayesian information criteria (BIC) will be consistent in the sense that they will 
select the model with the true dimension k° with probability 1, as the number of 
observations goes to infinite. This is the main pratical application of the results 
obtained in the paper. 



7 Appendix: Proof of the Theorem. 

Let 

if{z) - l 

s «( z ) := TTa IT' wri ere ||.|| 2 is the L 2 (fXd+i) norm, 

lly-llh 

be the generalized score functions. Firstly, we will get an asymptotic develop- 
ment of the generalized score when the model is over-parameterized. We will 
reparameterize the model using the same method as Liu and Shao [7] for the 
mixture models. 
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7.1 Reparametrization. 

If £j- — 1 = 0, we have j3 — j3° and a vector t = (t,)i<i<fco exists so that 
= to < t\ < ■ ■ ■ < t k o < k and, up to permutations, we have u> ti 1+1 = • • • = 

wu = u>°, £*' =ti _ 1+1 a i = a t Let s i = EjLti-i+i a 3- a 1 be and H = ° 3 - 

if Ei^i+i a j 7^ an( i otherwise qj = 0, we get then the reparameterization 
e=(^ t ,ipt) with 



With this parameterization, for a fixed i, $t is an identifiable parameter and all 
the non-identifiability of the model will be in ip t . Then ^j-(z) will be equal to 

(y- (^ + E-I 1 (^ + «°)E-L tl _ 1+ i*^K Ta; ))) 2 ) 

™P (» - + E-=i a?0K T x))) : 




Now, as the third derivative of the transfer function is bounded and thanks to the 
assumption H-2, the third order derivative of the function 'j-(z) with respect 
to the components of $t will be dominated by a square integrable function, 
because there exists a constant C so that we have the following inequalities: 

Vfc, Mi e >io, • • • , w kd } , sup ll ^gjj l < C(l + ||X|| 3 ). 

So, by the Taylor formula with an integral remainder around the identifiable 
parameter $° with 

$? = (/3°, «;?,■■■,«;? wgo,---,tugo 

S v ' * v ' 

tl i fe o — t k o_ 1 

we get the following Taylor expansion for the likelihood ratio : 

Lemma 7.1 For a fixed t, let us write D(§t,ipt) '■= || ( *^* ) — 1 1 1 2 - ^ 
neighborhood of the identifiable parameter $° , we get the following approxima- 
tion: 

f -j{z) = l+($ t -$°) T 4o^ ) ( Z )+0.5($ 1 -$?) 3, / ( ' i o A) ( Z )($ t -$?)+o(D($^ i )), 

($ t - *?)%o, w (*) - e(*) (/?-/?"+ E-li ^K° T *) 
+ E£iEjL ti _ 1+ i* -w?) T xa?^'(«;? T a;)) 
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and 



(**-*?)%o iW W(**-*?) = 

(i - ( (<f>t - *?)%^*>(*) W> T <*)<*« - *?)) 

+ Y.T=iY%= ti - 1+ i(<lj w j ~ w i) Txs i<f>' ( w f x )) . 



Proof of the lemma. This development, obtained by a straightforward cal- 
culation of the derivatives of fy( z ) with respect to the components of $ t up to 
the second order, is postponed to the end of this appendix. 

Now, the convergence to a Gaussian process will be derived from the Donskcr 
property of the set of generalized score functions S = {s$(z),6 £ 0fe}.Let an 
e-bracket [l,u] be a set of function h with I < h < u with \J P{1 — u) 2 < 
e. The bracketing number iVg (e,S, L 2 (f\d+i)) is the minimum number of e- 
brackets needed to cover S. The entropy with bracketing is the logarithm of the 
bracketing number. It is well known (see van der Vaart [8]) that the class of 
functions § will be Donsker if its entropy with bracketing grows with a slower 
order than ^ . A sufficient condition for Donskcr property is then that the 
bracketing number grows as a polynomial function of -. 



7.2 Polynomial bound for the growth of bracketing num- 
ber. 

Let us write D{8) := — 1|| 2 , for all e > 0, the set of parameters can be 

divided in two sets: § e and S with 

S £ = {6 £ O k so that L>(6») > e} and S = {6 £ O k so that D(6) < e} . 



For 9 1 and 02 belonging to S e , we get: 




Hence, on S e , it is sufficient that 



fei fe 2 
f f 



for 



/ 1 



/ 1 



J 0-2 _ 1 
/ 1 



f<>2 
f 



< e. 
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Now, § e is a parametric class. Since the derivatives of the transfer functions are 
bounded and -E||A|j < oo a function m(z) exists , with E[m(z)] < oo, so that 



V6»i e {P,a lt ■ ■ ■ ,a,k,wio, - ■ ■ ,wu, - ■ ■ ,u>kd} ■. 



— (z) 



< m(z). 



According to the exemple 19.7 of van der Vaart [8], it exists a constant K so 
that the bracketing number of S e is lower than 

'A- r\ \ fcx(d+2) + l / rp 7^-\ *=x(2d+4)+2 

^ diam9fc \ — K\ dia " m ® k ' 



where diamOfc is the diameter of the smallest sphere of R fc including 8^. 

f 



For 9 belonging to § > ~ 1 is the sum OI a linear combination of 



V(z) := (e(z),(e(z)x k xi<j>"(w° T x)) e{z)<j)' ' (wf a;)i<i< fe o , 

y \ / l<l<k<d, l<t<k u 

(eiz^^fx))^ ^ , (e^^fx))^ , (eiz^fx))^ 

and of a term whose L 2 (/A^+i) norm is negligible compared to the L 2 (fXd+i) 
norm of this combination when e goes to 0. By assumption H-3, a strictly pos- 
itive number m exists so that for any vector of norm 1 with components 

C = (c, ci, • • • , c fc0x d(d+i) ,di, ■ ■ ■ ,dfco,ei, ■ ■ ■ ,efco xd ,/i, ■ ■ ■ ,/ fe o,5i, ■ ■ ■ ,5 fe o^ 

and £ sufficiently small: 



||C J t/(z)|| 2 >m + e. 



Since any function — X — — can be written: 



C T V(z) + o(\\C T V(z)\\ 2 ) 
||C^) + o(||C^)|| 2 )|| 2 ' 

So belongs to the set of functions: 

D T V(z) + o(l), ||D|| 2 < 1) c \d t V{z) + 7 , \\D\\ 2 < -, | 7 | < 1 

whose bracketing number is smaller or equal to O (^) k X ^ 2 +d+3 )+ 2 

This proves that the bracketing number of § is polynomial, hence § is a 
Donskcr class. 
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7.3 Asymptotic index set. 

Since the class of generalized score functions § is a Donsker class the theorem 
follows from theorem 3.1 of Gassiat [6] or theorem 3.1 of Liu and Shao [7]. 
Following these authors, the set of limit score functions F fe is defined as the set 
of functions d so that one can find a sequence g n := fe^,& k ^ ®fe satisfying 

|| 9n J f || 2 — > and || d — s s J|2 — > 0, where s 9n = psrzjvr ■ Note that, for a 

particular sequence of maximum likelihood estimators (0 n ) ne ^, the partition 
of the indices can depend on n, but (#™)„ s n will be the union of converging 
sub-sequences belonging the set of limit score functions. 

Let us define the two principal behaviors for the sequences g n which influence 
the form of functions d : 

• If the second order term is negligible behind the first one : 

f -f( Z )-i = m *°)%o,^ ) (z) + {Dm,r k )). 

• If the second order term is not negligible compared to the first one : 

oMn - *?) T /(*o^-) - *?) + o{D{*z,M)). 

In the first case, each sequence g n is the finite union of convergent subsequences 
gk{n) and for each subsequence a set t = (to, ■ ■ ■ ,t k <>) e N k +1 (with = to < 
ti < ■ ■ ■ < tfco < k) exists so that the limit functions d of s 9fc („) will be: 

D| = n {je(z) + £ to *ie(z)<Kwfx) + e£o <*)4 {wfx)CT* 

7,ei,--,cfco eR ; Ci,---,Cfco eR d+1 }. 

In the second case, each sequence g n is the finite union of convergent subse- 
quences gk(n) and for each subsequence, an index i exists so that : 

ti 

9j(w j -w° i ) = 0, 

j=ti_i+i 

otherwise the second order term will be negligible compared to the first one, so 

ti 

x v%( w j - w i) = °- 

j=ii_i+i 

Hence, a set t — (to, ■ ■ ■ ,t k a) e N fc ° +1 exists, with = t < t\ < ■ ■ ■ < t k o < 
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k so that the set of functions d will be: 

n + E-Io ^e(,)0K oT x) + eEo e (^)0'K oT x)C T x 

+ E,t tfcO+1 ^e(z)0( W /x) 

+ E;=i e(z)sg{a°)<f>" (wfx) (<J(i) E *= ts _ 1+1 ^f^j) : 
7,ei,---,e fc o,a t fc0+1 ,---,4 e R ; Ci,"-,Ck° G Rd+1 
M,---,^ fe0 ^,---,< eR d } DD5, 

where <5(i) = 1 if a vector q exists with E^=t 4 _i+i % = 1 an< ^ EjLt^+i 
0, otherwise <5(«) = 0. 

So, the limit functions d will belong to ¥ k . 

Conversely, for x € L 2 (Xd+i ), let d be an element of F fe : 

d = n ( 7 e(*) + eEo ^(^K° t *) + eEo e{z)<j>\vfx)^ * 



+ 



As functions d belong to the Hilbert sphere, one of their components is not 
equal to 0. Let us assume that this component is 7 , but the proof would be 
similar with any other component. The norm of d is 1, so any component of d 
is determined by the ratio: -^0 • 

Then, we can chose 6>£ = (/?", a?, • • • , a£, toft, • • • , u;^, • • • , < d ) so that: 

Vi€{l,...,fc°} : ^t™^, 

Vie{i,...,fc°} : E^_ 1+1 ^K"--°)™ 7 0, 

Vj € {1, ■ ■ ■ , : « - ™ ^, 

since 0/c contains a neighborhood of the parameters realizing the true regression 
function Fgo. ■ 

7.4 The derivatives of the LR statistic 
7.4.1 Calculation of ($ t - $«) T f ($? ^ (z) 

In the sequel, we write x := (l,x\, - ■ ■ , Xd) T ■ 

To get (<I> t — &t) T f[$a ^ t )( z )j we compute the derivatives of the ^f{z) with 

respect to each parameter of $ t = (tUj (si)^^. 

Let us recall that e(z) := 4y - + a°<X&° + w t Tx ))) . we S et: 



-^A*?) = e(z) 
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d—(z) u 



j=*i_i+i 



Vj ' ' dw jd 



For + let us write .- . - ;r ,. - -. 



flu;, 

Forj G {tfco + 1, ■",*;} : 



e(z)<f)(wj T x) 



These equations yield us the expression of ($ t — 3>?) T / r<I> o ^ 



'(*?,</>*) 

.4.2 Calculation of (<f t - <P° t ) T f' ( ^ M (z)(<P t - *°) 

Q2h_( z ) 

-^(*?) = « 9 W-i 



^i^($?) = (e 2 (z)-l)0K oT x) 
For , e + 1, • • • , *<}, let us write ■= ( ■ ■ ■ ^ 

For j e {ifeo + 1, ■■■,*;} : 
d 2 &-(z) 



dfiddj 



($?) = (e 2 (z)-l) 0(tUj r a;) 



a 2 ^(z) 



ds l ds i > 



($?) = (e 2 (z) - 1) <AK° *)<K™° *) 
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d 2 i$-{z) ( a 2 
For j g {U-i + !,■■■, let us write := ( 



dsidwjo 1 ' dsidwjd 



d -Q^-{^t) = (e 2 (z) - 1) 4>{wfx)qjaU(wfx)x + e(z)q^ (wf x)x 



, (f) .. 



d 2 l»-(z) ( d 1 ^- 

For j g {tf-i+l, • • withi 7^ i , let us write := I 

^^(*?) = (e 2 (^) - 1) <f>(wfx) qi aU(wfx)x 
Forj G {tfco + 1, ■■■,*} = 

- (e 2 (^) - 1) 4>(wfx)<t>( Wj T x) 

For j g + 1, ■ ■ ■ et / g {ij'-i + 1, • • • , U>}, j ^ j', let us write 

/ d 2 if(z) d 2 lf-(z) \ 



dwjdwji 



dwjodw^t 



\ dwjddWjiQ dw jd dw jld J 



We get 



= (e 2 (z) - 1) q j q jl a° i 4>\wfx)44>'(wfx)xx T 
For j g {t,_i + 1, ■ ■ ■ , ti}, let us write 



1 




d 2 lf(z) 


\ 




Mo 


dwjodw jd 






2 ^(z) 


d 2 if(z) 




V 


dwjddwja 


9^ 2 d 


) 



We get 



= (e 2 (z) - 1) [ qj a i cf > '(wfx)yxx T +e(z)aU j <P"(wf . 

d 2 i£-(z) 

For j g {ti-x + 1, • • • , ij} and / € {t k o + 1, • • • , k}, let us write d J jdai 



x)xx 



I dwj da L ' ' dwjddai 
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= (e 2 (z) - 1) qjafy'iwfxMwfx)* 



dwjdai 
For {t k o + k} : 



" ' [Z) i^^(e 2 (z)-l)^w 3 T x)ct>{w l T x) 



da,j dai 
Now, the terms: 



1 e*(z) 



and 



H H ~ w i) T X Si (j) {wf x) 

i=l J=t 4 _l+1 

will be negligible compared to the first order term ($ t — $t) T f^o ^ t )( z ) when 
$ 4 — > $j, even if the term of first order is of the same order or negligible 
compared to the terms of the second order: 

/ k° u 

e ( Z ) X X] £ ^J'K' - ^) T XX T (Wj - v%)c%<l>\wf X) 
\i=l j=t 4 _i+l 

So, the development will be valid if for <3> t ^ z exists so that 
e(z)x(f3-(3° + j:Z lSi( t>(wfx) 

E-Ii E*= ts _ 1+1 - - w9)a^"(wfx)) ± 0. 

This inequality is guarantied by the assumption H-4. 

These equations yield us the expression of (<£> t — &t) T j ^ t ){ z ){^t ~ $?) • 
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